This lab explores the dog licensing data set: [https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp/data](https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp/data)

Filter this data set to only contain entries with the license expiring after today (Aug. 29) before downloading it as a CSV file.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

Read the CSV file into a dataframe called `dogs`, and display the dataframe to make sure it was loaded correctly.

Two of the columns are dates, so make them into `datetime` objects.

Get an over-view of the dataset by calling the `describe()` function.

1. Are the summary statistics (mean, standard deviation, min, etc.) meaningful for all of these columns?
2. Are we missing any data?

Answers:
1. No. The only column they might be meaningful for is `Extract Year`.
2. Yes, notice we have no borough information, despite it being a column. Also, some of the columns are missing!

To display columns with non-numeric data, we need to use the following command.

In [None]:
dogs.describe(include = ["O"])

What's the most common dog name?

To visualize the distribution of a qualitative or categorica column, we will make a bar chart (as in MAT 128). Let's do this for the `AnimalGender` column.

First get the value counts.

In [None]:
gender_counts = dogs['AnimalGender'].value_counts()
gender_counts

Are there roughly 50% male and female dogs?

Next we plot the counts:

In [None]:
gender_counts.plot.bar()

Add a title and axis labels. Remember to repeat your original plot.

Let's plot the distribution of another categorial column, `BreedName`.

Is this plot as readable as the gender one?

We can fix this by just plotting the top 10 breeds using filtering. First, look at the counts to see what the cut-off should be.

It looks like 1000 would be a good cut-off. We can do the filter one of two ways:

In [None]:
top_filter = breed_counts >=1000
breed_counts[top_filter].plot.bar()

In [None]:
breed_counts[breed_counts >= 1000].plot.bar()

What's the most popular breed of dog?

In Lab 1, we only filtered by a single criteria, but we can filter by multiple criteria using `&` (and) and `|` (or). For example, if we wanted to count the number of dog licenses issues in August to a Chihuahua:

In [None]:
aug_chihuahua_filter = (dogs["BreedName"] == "Chihuahua") & (dogs["LicenseIssuedDate"].dt.month == 8)
len(dogs[aug_chihuahua_filter])

Or if we wanted to count the number of dogs named BELLA or dogs whose license expires in 2020:

In [None]:
bella_2020_filter = (dogs["AnimalName"] == "BELLA") | (dogs["LicenseExpiredDate"].dt.year == 2020)
len(dogs[bella_2020_filter])

With a partner or by yourself, answer the following questions:

1. How many dogs are registered in the Lehman zip code (10468)? 

2. Plot a bar chart of the number of dog licenses in the top 10 zip codes? Which zip code has the most dog licenses and where is it located (type a zip code into Google maps to see the area)?

3. How many female Labrador Retreivers are licensed?

4. How many licenses are due to expired by the end of September?

5. Ask and answer your own question(s) about this dataset.

Challenge question: Can you find the oldest dog in the dataset?